# connect to database
mydb <- dbConnect(
RSQLite::SQLite(),
"~/repos/airwars_scraping_project/database/airwars_db.sqlite")
dbListTables(mydb) # print tables in database[1] "airwars_incidents" "airwars_meta" "daily_casualties"
___________________________________________________________________________
This project was created on a lenovo legion 7i laptop with a i9-14900HX chip, 64gb of DDR5 RAM, and an Nvidia RTX-4070 GPU with 8gb of GDDR6. The operating system initially used was Ubuntu 24.10. However, when we attempted to configure our GPU to process text data for the language model we learned that this version of Ubuntu contains the newest kernel which updates nvidia-cli and cuda drivers that are not compatible with tensorflow or pytorch needed for the text package. We moved to Ubuntu 24.04 LTS within WSL2 on Windows 11 and we were able to configure the GPU. Given that Airwars also goes back and corrects archived incidents, it is easier to just run the full process on all available incident records when needed, and the GPU cuts back on the processing time. Modeling data on a GPU for the language model cut the process time from 2.5 hours to about 20 minutes.
We use r-base 4.4.1 from anaconda and rstudio 2024.04.02. We have also used these same packages on rstudio-server via WSL2 but prefer to isolate the computing environment.
Table 1 contains incident metadata (e.g., unique id, incident date, web-page URL).
Table 2 stores the specific incident information such as the number of deaths, breakdown of deaths (children, adults), type of attack, and cause of death, incident coordinates and results from Nominatim, and sentiment scores for seven emotional states.
Table 3 contains the Hamas Ministry of Health (MoH) daily casualties.
The first two tables relate to each other through the unique incident identification numbers provided by Airwars. We relate the MoH table with the Airwars tables by aggregating up to the date.
# connect to database
mydb <- dbConnect(
RSQLite::SQLite(),
"~/repos/airwars_scraping_project/database/airwars_db.sqlite")
dbListTables(mydb) # print tables in database[1] "airwars_incidents" "airwars_meta" "daily_casualties"
# read in data tables
airwars_meta <- tbl(mydb, "airwars_meta") |>
as_tibble() |>
# convert Incident_Date to date format
mutate(Incident_Date = as_date(Incident_Date)) |>
arrange(Incident_Date)
airwars_meta |> head() |> kable()| Incident_Date | Incident_id | link |
|---|---|---|
| 2023-10-07 | ispt0019a | https://airwars.org/civilian-casualties/ispt0019a-october-7-2023/ |
| 2023-10-07 | ispt0019 | https://airwars.org/civilian-casualties/ispt0019-october-7-2023/ |
| 2023-10-07 | ispt0017 | https://airwars.org/civilian-casualties/ispt0017-october-7-2023/ |
| 2023-10-07 | ispt0011 | https://airwars.org/civilian-casualties/ispt0011-october-7-2023/ |
| 2023-10-07 | ispt0010 | https://airwars.org/civilian-casualties/ispt0010-october-7-2023/ |
| 2023-10-07 | ispt0003 | https://airwars.org/civilian-casualties/ispt0003-october-7-2023/ |
airwars_incidents <- tbl(mydb, "airwars_incidents") |>
as_tibble()
airwars_incidents |>
head() |>
select(-assessment:-surprise) |>
kable()| Incident_id | Strike status | Strike type | Civilian infrastructure | Civilian harm reported | Civilians reported killed | Civilians reported injured | Cause of injury / death | Airwars civilian harm grading | Impact | Suspected belligerent | min_killed | max_killed | casualty_estimate | killed | location_meta | incident_lat | incident_long | target_type | target_address_type | lat_min | lat_max | long_min | long_max | Suspected belligerents | Known belligerent | Suspected target | Known target | Causes of injury / death | children_killed | women_killed | men_killed | Civilian_type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ispt0001 | Single source claim | Airstrike and/or Artillery | Healthcare facility | Yes | 1 | 2 | Heavy weapons and explosive munitions | Fair | Healthcare | Israeli Military | 1 | 1 | absolute | 1 | Rafah the Gaza Strip | 31.296628 | 34.244689 | tertiary | road | 31.29471 | 31.29727 | 34.24404 | 34.24777 | NA | NA | NA | NA | NA | NA | NA | 1 | (1 man1 healthcare_personnel) |
| ispt00012 | Likely strike | Airstrike | Healthcare facility | Yes | 1 | 4 | Heavy weapons and explosive munitions | Fair | Healthcare | Israeli Military | 1 | 1 | absolute | 1 | Nasser medical complex Khan Younis the Gaza Strip | 31.347002 | 34.292327 | hospital | amenity | 31.34554 | 31.34852 | 34.29053 | 34.29406 | NA | NA | NA | NA | NA | NA | NA | 1 | (1 man1 healthcare_personnel) |
| ispt00013 | Likely strike | Airstrike | Healthcare facility | Yes | Unknown | 5 | Heavy weapons and explosive munitions | Fair | Healthcare | Israeli Military | NA | NA | NA | NA | Nasser medical complex Khan Younis the Gaza Strip | 31.347153 | 34.292774 | hospital | amenity | 31.34554 | 31.34852 | 34.29053 | 34.29406 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| ispt0002 | Likely strike | Airstrike | Residential building | Yes | 3 – 7 | 4 | Heavy weapons and explosive munitions | Fair | NA | Israeli Military | 3 | 7 | range | 5 | Khuza a Khan Younis the Gaza Strip | 31.522567 | 34.462869 | tertiary | road | 31.52175 | 31.53208 | 34.46157 | 34.47451 | NA | NA | NA | NA | NA | 2 | 3 | 1 | (2 children3 women1 man) |
| ispt0003 | Likely strike | Airstrike | NA | Yes | 1 | 3 | Heavy weapons and explosive munitions | Fair | NA | Israeli Military | 1 | 1 | absolute | 1 | Al Baraa Mosque Gaza the Gaza Strip | 31.501846 | 34.437058 | place_of_worship | amenity | 31.50165 | 31.50191 | 34.43679 | 34.43723 | NA | NA | NA | NA | NA | NA | NA | 1 | (1 man) |
| ispt0004 | Likely strike | Airstrike | Residential building | Yes | 15 | NA | Heavy weapons and explosive munitions | Fair | NA | Israeli Military | 15 | 15 | absolute | 15 | home of the Al Dous family in Al Zaytoun neighborhood south of Gaza City Gaza the Gaza Strip | 31.485479 | 34.444885 | residential | road | 31.48442 | 31.48662 | 34.44390 | 34.44632 | NA | NA | NA | NA | NA | 7 | 3 | 5 | (7 children3 women5 men) |
tbl(mydb, "daily_casualties") |>
as_tibble() |>
mutate(Incident_Date = lubridate::as_date(Incident_Date)) |>
head() |>
kable()| Incident_Date | name | value |
|---|---|---|
| 2023-10-07 | Children | 0 |
| 2023-10-07 | Women | 0 |
| 2023-10-07 | Total | 232 |
| 2023-10-08 | Children | 78 |
| 2023-10-08 | Women | 41 |
| 2023-10-08 | Total | 370 |
Airwars when possible includes location coordinates of where the incident took place. Although this information is contained within the assessment, Airwars standardizes it’s location with a heading under “Geolocation notes” which we were able to parse the latitude and longitude to use for geographic plotting. Of the 804 Incidents about 65% contain geographic coordinates.
airwars_incidents |>
select(target_type, contains("lat"), contains("long")) |>
head() |>
kable()| target_type | incident_lat | lat_min | lat_max | incident_long | long_min | long_max |
|---|---|---|---|---|---|---|
| tertiary | 31.296628 | 31.29471 | 31.29727 | 34.244689 | 34.24404 | 34.24777 |
| hospital | 31.347002 | 31.34554 | 31.34852 | 34.292327 | 34.29053 | 34.29406 |
| hospital | 31.347153 | 31.34554 | 31.34852 | 34.292774 | 34.29053 | 34.29406 |
| tertiary | 31.522567 | 31.52175 | 31.53208 | 34.462869 | 34.46157 | 34.47451 |
| place_of_worship | 31.501846 | 31.50165 | 31.50191 | 34.437058 | 34.43679 | 34.43723 |
| residential | 31.485479 | 31.48442 | 31.48662 | 34.444885 | 34.44390 | 34.44632 |
After attempting several text classification models and some question/context model we landed onj-hartmann/emotion-english-distilroberta-base because it goes beyond just a positive/negative evaluation but analysis text for Ekman’s 6 basic emotions that is common in psychological work on emotions.Moreover, this model affords us the ability to examine the emotion tone over time for these assessments.2
We get scores for each emotions, the closer to one the stronger the association, while all the scores add up to 1.
The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
Given that we have over 800 assessments we decided to use text3 while it allows us the ability to use a laptop GPU (GTX 4070)4 to process these models for each incident. This resulted in large processing gains.
Below we print an example of these scores while we truncate the assessment text.
airwars_incidents |>
slice_sample(n=1) |>
select(assessment:surprise) |>
mutate(assessment = str_trunc(assessment, 200),
across(where(is.double), ~ round(.x, 2))) |>
kable()| assessment | anger | disgust | fear | joy | neutral | sadness | surprise |
|---|---|---|---|---|---|---|---|
| On Tuesday, October 29th 2023, at least 21 members members of the Thabet family, including four women and ten children, were killed and several civilians were injured in an alleged Israeli airstrik… | 0.18 | 0 | 0.68 | 0 | 0 | 0.1 | 0.02 |
Note. Confidence is low to moderate since the data comes from the Hamas MoH.↩︎
An R-package for analyzing natural language with transformers from HuggingFace using Natural Language Processing and Machine Learning.↩︎
The installation for Text is tricky as the right python libraries must be installed. To compile models with the GPU, we learned that nvidia cuda drivers must be installed for version 12.1. Additionally, we could only get this to work via anaconda within Ubuntu 24.04 installed through WSL2 on Windows 11. Ubuntu 24.10 comes with a kernal that forces cuda 12.8 to be installed and did not work for us in a dual boot system.↩︎